What is Attribution Modeling
Before jumping to the data analysis let’s first set the context for the problem that Markov Chain Attribution modeling addresses.
Many Advertising Channels
When marketers spend money on advertising they strategically allocate budget towards different campaigns which in turn use a multitude of advertising channels. Some common adverting channels include cable and streaming TV, Programmatic, Social, Paid Search, Print. How these are defined, and which media partners and platforms are used to spend the planned budget varies, but the fact that advertisers communicate through a multitude of channels remains.
Goals and KPI’s
Let’s keep in mind that marketers don’t advertise for the sake of advertising. They have particular goals and objectives around which a marketing communications strategy is built. One of the roles of Analytics is to help build a measurement plan that will define a set of Key Performance Indicators (KPI’s) that will serve as a measure of success for the marketing efforts. Depending on the goals, and technical ability these KPI’s are chosen and tools for measuring them are put in place. For example we may want to generate visits to a website or a webpage, generate leads through form submissions, have users perform certain actions on our website, raise aware for our brand, or simply drive sales. Whatever the KPI is, a unit of a completed KPI is referred to as a Conversion.
ROI and Attribution
While we are running a campaign and spending our budget, how do evaluate the degree of importance of the different channels in driving our KPI? Obviously if our goal was to drive sales but, all else being equal, our sales did not increase, then none of the channels worked well. But if we do succeed in driving incremental sales, which channel is working best? In other words, how do we Attribute the incremental sales to different channels and compute our Return on Investment (ROI, sometimes referred to as return on advertising (ROAS))? Should we tactically alter our plan because one of the channels is not performing as expected? Answering these questions is one way Analytics ads value to Advertising.
Last-Touch in the Consumer Path
One way to answer this question comes about as a result of digital advertising and consumer level data. Digital advertising can often be cheaper and more easily measured. The tractability allows us to collect data and analyze parts of the Consumer Path that over time lead to a Conversion. A consumer journey exists whether or not it’s tracked, and no one can ever really track the whole journey. What we can track are some of the Media Touchpoints. Often the easiest way to Attribute Conversions is to simply look at the Last-Touch in the Consumer Path. That’s because often that is the only touchpoint we do see. Giving the Last-Touch full credit for driving a Conversion will result in an unfair assessment of our media effectiveness and will not account for the Synergy that a mix of channels aims to create.
library(visNetwork)
library(tidyverse)
nodes <-
data.frame(id = 1:4,
label = c("Display", "TV", "Search 100%", "Conversion"),
color = c("grey", "grey", "lightblue", "green"),
shape = c("square","square","square","square"))
edges = data.frame(from = 2:4, to = 1:3)
visNetwork(nodes, edges, width = "100%") %>%
visEdges(arrows = "from") %>%
visIgraphLayout(layout = "layout_on_grid")
Thankfully if an adequate effort to track our media is made we can generate Path Data and hope to attribute in a more sophisticated way, that is using Multi-Touch Attribution (MTA). This allows us to better judge the effectiveness of channels used, and make better investment decision.
nodes <-
data.frame(id = 1:4,
label = c("Display 25%", "TV 35%", "Search 40%", "Conversion"),
color = c("lightblue", "lightblue", "lightblue", "green"),
shape = c("square","square","square","square"))
edges = data.frame(from = 2:4, to = 1:3)
visNetwork(nodes, edges, width = "100%") %>%
visEdges(arrows = "from") %>%
visIgraphLayout(layout = "layout_on_grid")
Multi-Touch Attribution
There are two main categories of MTA methods; Rule-Based and Data-Driven. Of course both of these use data but what distinguishes them is how they assign importance to touch-points along the consumer path. Rule-Based methods heuristically assign a weights to touch-points based on their position. You can find more info about these here.
Data driven methods take a different approach. Instead of assigning arbitrary rules based on position they establish probabilistic relationships between touchpoints and their role in driving Conversions.
Markov Chain Attribution
Markov Chain Attribution is one of the more popular data-driven methods, and as the name suggests it takes advantage of Markov Chains. Unless you studies Operations Research or Math at university you might not know what these chains are so let’s do a high-level intro.
Markov Chains
The key concept to remember about Markov Chains is the Transition Matrix. Markov Chains themselves are a conceptual tool for describing a (stochastic) system. All this means is that when there are random or probabilistic component to a process, Markov Chains might be a good tool to first describe, and then analyze this process. The process we are concerned with here are Customer Journeys towards Conversions.
The Fundamentals
We can visualize a markov chain as a directed network. Each node is a State that a consumer can while following a path. Each arrow is a Transition, and with it there is an associated probability. So there is a predetermined set of touch-points you may have, but which will happened next is uncertain.
library(visNetwork)
library(tidyverse)
nodes <-
data.frame(id = 1:6,
label = c("Non-Conversion", "Display", "TV",
"Search", "Conversion","Start"),
color = c("darkred", "grey", "grey",
"grey", "lightblue", "green"))
edges = expand.grid(from = 1:6, to = 1:6)
edges = edges %>% filter(from != to, to != 5, to != 1, from != 6)
visNetwork(nodes, edges, width = "100%") %>%
visEdges(arrows = "from") %>%
visIgraphLayout(layout = "layout_with_kk")
The conversion and non-conversion nodes have arrows pointing to them from all channels, but no arrows emanating from them. These are called Absorbing States and for us they indicate an end of a journey. When we collect all of the probabilities of transitioning from one state to the next, we organize it in a square Transition Matrix.
M = t(matrix(c(1, 0, 0, 0, 0,
200, 0, 300, 350, 10,
160, 350, 0, 300, 8,
120, 200, 300, 0, 6,
0, 0, 0, 0, 1), nrow = 5))
M = M/rowSums(M)
colnames(M) = c("Non-Conversion", "Display","TV","Search","Conversion")
row.names(M) = c("Non-Conversion", "Display","TV","Search","Conversion")
as.data.frame(round(M,3), row.names = row.names(M)) %>%
rownames_to_column() %>% gt::gt() %>%
gt::tab_header(title = "Theoretical Transition Matrix")
| Theoretical Transition Matrix | |||||
|---|---|---|---|---|---|
| Non-Conversion | Display | TV | Search | Conversion | |
| Non-Conversion | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Display | 0.233 | 0.000 | 0.349 | 0.407 | 0.012 |
| TV | 0.196 | 0.428 | 0.000 | 0.367 | 0.010 |
| Search | 0.192 | 0.319 | 0.479 | 0.000 | 0.010 |
| Conversion | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
Above is an example Transition Matrix which I just put together for illustration. To make it realistic I set the conversions probabilities to relatively low values. Later we will use this matrix for Simulation!
An important property to keep in mind is that rows sum up to one since we are dealing with probabilities. The rows represent to current sate in the journey, and columns represent the next state.
rowSums(M)
## Non-Conversion Display TV Search Conversion
## 1 1 1 1 1
There is one more important thing to know about Markov Chains, and that is the Markov Property or the memorylessness. This will be TMI for some, but this is really the reason why it is possible to summarize the whole process in one simple Transition Matrix. The property is a simplifying assumption stating that the probability of where you transition to next depends only on where you are now, but not on your whole journey history. To be clear, this doesn’t mean that the journey history doesn’t matter. It absolutely does impact attribution results. What it doesn’t impact is the probability. This makes analysis significantly simpler from a mathematical point of view.
From Chains to Paths (Simulation)
In practice, given a data-set of consumer paths we may estimate a Transition Matrix. The reverse is also true. Given a Transition Matrix we can simulate a set of consumer paths. In fact simulation plays an important role in deriving the actual levels of importance for each channel, so it is worth seeing the mechanics of this process.
Below we define a function that will work well with the above sample matrix \(M\). Given this matrix and a number of Markov Chain steps we want to make num_sim we do the following:
Randomly select a starting point by sampling from the set of media touchpoints
- In practice we would actually estimate the starting point probabilities.
For each step:
Pick a row of probabilities from \(M\) corresponding to the current state/touchpoint.
- Take one sample from a
Multinomialdistribution with probabilities set to the row.
- Take one sample from a
Set the sample as the new current state and repeat.
If we hit a
ConversionorNon-Conversionwe end the path.
simulate_path = function(M, num_sim){
num_sim = num_sim
path = vector(mode = "integer", length = num_sim)
path[1] = sample(2:(nrow(M)-1), 1)
for(i in 1:(num_sim-1)){
p = M[path[i],]
sn = which(rmultinom(1, 1, p) == 1)
path[i+1] = sn
if(sn == 1 | sn == 5){
break
}
}
return(path[path > 0])
}
Simulating One Path
Let’s test out this function. The first example is a non-converter.
set.seed(124)
num_sim = 100
path = simulate_path(M, num_sim)
plot(seq_along(path), path ,
type = "l", axes=FALSE, ylim = c(1,5),
main = "Non-Converter",
ylab = "Channel Touchpoint",
xlab = "Touchpoint Number")
points(seq_along(path), path)
axis(side = 2, at = 1:5)
axis(side = 1, at = seq_along(path))

We see that we start out with Display, move to Search, then TV, carry on the journey until step 9 where we drop out without converting.
This is another example but this time we have a converter.
set.seed(007)
num_sim = 100
path = simulate_path(M, num_sim)
plot(seq_along(path), path ,
type = "l", axes=FALSE, ylim = c(1,5),
main = "Converter",
ylab = "Channel Touchpoint",
xlab = "Touchpoint Number")
points(seq_along(path), path)
axis(side = 2, at = 1:5)
axis(side = 1, at = seq_along(path))

This time we started out with TV, jumped around between Display and Search and converted on step 8 with Search being our last touchpoint.
Simulating a Full Dataset of Journeys
We can now simulate a whole set of paths simply by repeating the function call simulate_path. With a few more data wrangling operations we have a pathdf. Here I will generate 5000 paths.
num_sim = 100
num_paths = 10000
paths = purrr::map(1:num_paths, ~simulate_path(M, num_sim))
conversion =
purrr::map(paths, ~ data.frame(conversion =
ifelse(.x[length(.x)] == 5,
"converter",
"non-converter")
)
) %>% bind_rows(.id = "path_num")
pathdf =
map(paths, ~data.frame(touchpoint = 1:length(.x), channel = .x)) %>%
bind_rows(.id = "path_num") %>%
left_join(conversion) %>%
left_join(
data.frame(channel_name = colnames(M), channel = 1:5)
)
head(pathdf,10) %>% gt::gt() %>%
gt::tab_header(title = "Simmulated Paths")
| Simmulated Paths | ||||
|---|---|---|---|---|
| path_num | touchpoint | channel | conversion | channel_name |
| 1 | 1 | 4 | non-converter | Search |
| 1 | 2 | 2 | non-converter | Display |
| 1 | 3 | 1 | non-converter | Non-Conversion |
| 2 | 1 | 4 | non-converter | Search |
| 2 | 2 | 2 | non-converter | Display |
| 2 | 3 | 3 | non-converter | TV |
| 2 | 4 | 4 | non-converter | Search |
| 2 | 5 | 3 | non-converter | TV |
| 2 | 6 | 1 | non-converter | Non-Conversion |
| 3 | 1 | 3 | non-converter | TV |
Let’s now plot all of these paths on top of one another to see how this Markov Chain behaves.
plotly::ggplotly(
pathdf %>%
ggplot(aes(touchpoint, channel, color = conversion, group = path_num)) +
geom_line() +
labs(x = "Touchpoint Number", y = "Channel Touhcpoint") +
theme_minimal()
)
From the plot we see that more often lines drop down, which indicates a non-converter. This is as expected. We can see what the simulated conversion rate was. It is in fact just over 5%.
table(conversion$conversion)/nrow(conversion)
##
## converter non-converter
## 0.0462 0.9538
From Paths to Chains
Now that we have simulated a set of consumer paths from our Markov Chain \(M\) we can try to use them to estimate the original \(M\) with \(\hat{M}\) and see how well the estimation method works. Here I will introduce the brilliant ChannelAttribution package authored by Davide Altomare. It is open source and hosted on GitLab here. This package contain a function called transition_matrix which takes in a data-set of paths and estimates the Transition Matrix.
Data Prep
The first order of business is to structure the data in the required format. First we need to parse out the final touchpoint in each path since they represent a conversion/non-conversion event. Then we summarize the data grouping by path and counting number converters and non-converters for each unique path observed.
named_paths =
pathdf %>% group_by(path_num) %>%
group_split() %>%
map(~pull(.x,channel_name))
path_trim = map(named_paths, ~.x[.x != "Non-Conversion" & .x != "Conversion"])
journeydf =
as_tibble(cbind(as.data.frame(do.call(rbind,map(path_trim, ~str_c(.x, collapse = " > ")))),conversion$conversion))
names(journeydf) = c("path","conversion")
journeydf =
journeydf %>%
group_by(path) %>%
summarise(converters = sum(if_else(conversion == "converter", 1, 0)),
non_converters = sum(if_else(conversion == "non-converter", 1, 0))) %>%
arrange(-converters, -non_converters)
head(journeydf, 15) %>% gt::gt() %>%
gt::tab_header(title = "Simmulated Journey Data (Top 15 Converters)")
| Simmulated Journey Data (Top 15 Converters) | ||
|---|---|---|
| path | converters | non_converters |
| Display | 41 | 824 |
| TV | 31 | 608 |
| Search | 23 | 626 |
| Search > TV | 17 | 330 |
| TV > Display | 15 | 319 |
| Search > Display | 12 | 256 |
| Display > Search | 10 | 237 |
| Display > TV | 10 | 232 |
| Display > Search > TV | 8 | 140 |
| TV > Search | 7 | 235 |
| Search > TV > Display | 7 | 155 |
| TV > Display > TV | 7 | 110 |
| TV > Search > Display | 7 | 92 |
| Display > TV > Display | 6 | 117 |
| Search > Display > TV | 6 | 67 |
Here are some quick summary stats about the generated path data.
Table 2
There are many other wasy to explore this kind of data. One quick and interestion one is the distribution of path lengths.
path_lengths = map_int(path_trim, ~length(.x))
summary(path_lengths)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.588 6.000 43.000
hist(path_lengths, main = "Path Length Distribution")

Transition Matrix Estimation
Finally, now that we have our simulated data-set with a known transition matrix \(M\) , we can now see how we can go from data back to a transition matrix, via estimation and compute \(\hat{M}\). Here we apply the ChannelAttribution::transition_matrix() function to out data. The function returns a lookup table for the channel ids and a table to transition probabilities.
library(ChannelAttribution)
tM = transition_matrix(journeydf,
var_path = "path",
var_conv = "converters",
var_null = "non_converters")
tM$transition_matrix %>% gt::gt() %>%
gt::tab_header(title = "Transition Probabilities")
| Transition Probabilities | ||
|---|---|---|
| channel_from | channel_to | transition_probability |
| (start) | 1 | 0.341000000 |
| (start) | 2 | 0.326500000 |
| (start) | 3 | 0.332500000 |
| 1 | (conversion) | 0.011514909 |
| 1 | (null) | 0.236687966 |
| 1 | 3 | 0.404885517 |
| 1 | 2 | 0.346911608 |
| 2 | (conversion) | 0.009719222 |
| 2 | (null) | 0.196480752 |
| 2 | 1 | 0.431393724 |
| 2 | 3 | 0.362406302 |
| 3 | (conversion) | 0.008998875 |
| 3 | (null) | 0.191159929 |
| 3 | 2 | 0.480711970 |
| 3 | 1 | 0.319129226 |
We can format and the table and output in a way where it is easy for us to compare the estimated and the true transition matrices
tM =
tM$transition_matrix %>%
left_join(tM$channels %>%
mutate(channel_from = as.character(id), from = channel_name) %>%
select(channel_from, from)) %>%
left_join(tM$channels %>%
mutate(channel_to = as.character(id), to = channel_name) %>%
select(channel_to, to)) %>%
mutate(from = if_else(is.na(from), channel_from, from),
to = if_else(is.na(to), channel_to, to)) %>%
select(-channel_from, -channel_to) %>%
pivot_wider(names_from = to, values_from = transition_probability) %>%
filter(from != "(start)") %>%
relocate(`(null)`, .before = Display) %>%
column_to_rownames("from")
tM[is.na(tM)] = 0
as.data.frame(round(tM,3), row.names = row.names(tM)) %>%
rownames_to_column() %>% gt::gt() %>%
gt::tab_header(title = "Estimated Transition Matrix")
| Estimated Transition Matrix | |||||
|---|---|---|---|---|---|
| (null) | Display | TV | Search | (conversion) | |
| Display | 0.237 | 0.000 | 0.347 | 0.405 | 0.012 |
| TV | 0.196 | 0.431 | 0.000 | 0.362 | 0.010 |
| Search | 0.191 | 0.319 | 0.481 | 0.000 | 0.009 |
as.data.frame(round(M[-c(1,5),],3), row.names = row.names(M[-c(1,5),])) %>%
rownames_to_column() %>% gt::gt() %>%
gt::tab_header(title = "Original Transition Matrix")
| Original Transition Matrix | |||||
|---|---|---|---|---|---|
| Non-Conversion | Display | TV | Search | Conversion | |
| Display | 0.233 | 0.000 | 0.349 | 0.407 | 0.012 |
| TV | 0.196 | 0.428 | 0.000 | 0.367 | 0.010 |
| Search | 0.192 | 0.319 | 0.479 | 0.000 | 0.010 |
Attribution
So far we introduced a theoretical markov chain in the form of a transition matrix \(M\) , used it to simulate a dataset of consumer paths, and returned back to a transition matrix by estimating \(\hat{M}\). Now it’s time to convert, and move on to the actually attribution model!
Removal Effects
Markov Chains seem to be a good idea as a model for Consumer Journeys. But how to use them to actually attribute conversions is not immediately clear. The main idea here is to compute what are called Removal Effects. In essence we are reversing the original question of how many conversions a channel brings, by asking how many less conversions would we have if a channel was omitted from the plan. This quantity is the Removal Effect. If we compute this for each channel in the markov chain then the Relative Removal Effects will be the factor by which we scale the total conversion volume.
Running Attribution
As we just learned running a markov attribution model involves estimating a transition matrix and then computing the removal effects. To do this we simply use the markov_model function, which takes the data in the prepared format. By specifying out_more = TRUE in addition to the attribution results we also get back the transition matrix and removal effects.
mm_res =
markov_model(journeydf,
var_path = "path",
var_conv = "converters",
var_null = "non_converters",
out_more = TRUE)
##
## Number of simulations: 100000 - Convergence reached: 2.65% < 5.00%
##
## Percentage of simulated paths that successfully end before maximum number of steps (44) is reached: 99.99%
Attribution Results
First let’s take a look at the results. The total number of conversions was X. The below table shows how these were attributed by our Data-Driven model.
mm_res$result %>% gt::gt()
| channel_name | total_conversions |
|---|---|
| Display | 153.6905 |
| TV | 156.9407 |
| Search | 151.3688 |
As mentioned before we use the these values are computed using the removal effects. Here are the actual removal effects and explicit code that computes the attributed conversions.
sum(journeydf$converters) * mm_res$removal_effects$removal_effects/sum(mm_res$removal_effects$removal_effects)
## [1] 153.6905 156.9407 151.3688
Quick Comparison to Rule-Based Models
heuristic_models(journeydf,
var_path = "path",
var_conv = "converters") %>%
left_join(
data.frame(channel_name = c("2","3","4"),
channel = c("Display","TV","Search"))) %>%
gt::gt()
## Joining, by = "channel_name"
| channel_name | first_touch | last_touch | linear_touch | channel |
|---|---|---|---|---|
| Display | 154 | 173 | 160.6948 | NA |
| TV | 147 | 153 | 157.8124 | NA |
| Search | 161 | 136 | 143.4928 | NA |